A crucial issue of current text generation models is that they often uncontrollably generate factually inconsistent text with respective of their inputs. Limited by the lack of annotated data, existing works in evaluating factual consistency directly transfer the reasoning ability of models trained on other data-rich upstream tasks like question answering (QA) and natural language inference (NLI) without any further adaptation. As a result, they perform poorly on the real generated text and are biased heavily by their single-source upstream tasks. To alleviate this problem, we propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck. WeCheck first utilizes a generative model to accurately label a real generated sample by aggregating its weak labels, which are inferred from multiple resources. Then, we train the target metric model with the weak supervision while taking noises into consideration. Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4\% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
translated by 谷歌翻译
言语的数字,例如隐喻和讽刺,在文学作品和口语对话中无处不在。这对自然语言理解构成了巨大的挑战,因为语音的数字通常偏离表面上表达更深层次的语义含义的含义。先前的研究强调了数字的文学方面,很少从计算语言学的观点提供全面的探索。在本文中,我们首先提出了象征性单元的概念,该单元是人物的载体。然后,我们选择了中文常用的12种类型的数字,并构建中文语料库以进行上下文化的图形识别(配置)。与以前的令牌级别或句子级别对应物不同,配置旨在从话语级别的上下文中提取象征性单元,并将象征性单元分类为正确的图类型。在配置时,设计了三个任务,即图形提取,图类型分类和图形识别,并使用最新技术来实现基准。我们进行彻底的实验,并表明所有三个任务对于现有模型都充满挑战,因此需要进一步研究。我们的数据集和代码可在https://github.com/pku-tangent/configure上公开获取。
translated by 谷歌翻译
随着视频数量的越来越多,对技术的需求很大,可以帮助人们迅速导航到他们感兴趣的视频片段。但是,当前的视频理解主要理解主要是视频内容摘要,而几乎没有努力,而对探索视频的结构。受文本轮廓生成的启发,我们介绍了一项新颖的视频理解任务,即视频大纲生成(VOG)。该任务定义为包含两个子任务:(1)首先根据内容结构对视频进行分割,然后(2)为每个段生成一个标题。要学习和评估VOG,我们注释了一个10K+数据集,称为Duvog。具体来说,我们使用OCR工具来识别视频的字幕。然后,要求注释者将字幕分为章节,并将每个章节分为标题。在视频中,突出显示的文本往往是标题,因为它更有可能引起人们的注意。因此,我们提出了一个视觉字幕功能增强的视频大纲生成模型(VSENET),该模型将文本字幕及其视觉字体大小和位置作为输入。我们将VOG任务视为一个序列标记问题,该问题提取了跨标题的位置,然后将其重写以形成最终大纲。此外,基于视频概述和文本概述之间的相似性,我们使用大量文章带有章节标题来预先我们的模型。 Duvog上的实验表明,我们的模型在很大程度上胜过其他基线方法,对于视频分割水平达到了77.1的F1得分,对于标题生成级别的Rouge-L_F0.5的85.0。
translated by 谷歌翻译
出色的图像文本检索模型取决于高质量标记的数据。尽管现有图像文本检索数据集的构建者努力确保标题与链接的图像匹配,但它们无法阻止字幕拟合其他图像。我们观察到,如此多的匹配现象在广泛使用的检索数据集中非常普遍,其中一个标题可以描述多达178张图像。这些较大的匹配失误数据不仅使训练中的模型混淆,而且还会削弱评估精度。受视觉和文本核心任务的启发,我们提出了一个多模式的核心分类器,以确定句子是否由图像和其链接的字幕所带来。随后,我们通过将这些需要的字幕添加为图像的附加标签来修改图像文本检索数据集,并制定通用可变率策略,以教授检索模型以区分所需的字幕和其他负样本。在实验中,我们手动注释了一个需要校正的图像文本检索数据集进行评估。结果表明,所提出的元素分类器可实现约78%的精度,并始终提高图像文本检索基线的性能。
translated by 谷歌翻译
方面情绪三重态提取(ASTE)旨在从句子中提取三胞胎,包括目标实体,相关情感极性,以及合理化极性的意见跨度。现有方法缺乏目标 - 意见对之间的构建相关性,并忽略不同情绪三联体之间的相互干扰。为了解决这些问题,我们利用了两阶段框架来增强目标和意见之间的相关性:在阶段,通过序列标记提取目标和意见;然后,我们附加了一组名为可感知对的人工标签,其指示特定目标意义元组的跨度,输入句子以获得更接近相关的目标意见对表示。同时,我们通过限制令牌的注意力领域来降低三态层之间的负干扰。最后,根据可感知对的表示来识别极性。我们对四个数据集进行实验,实验结果表明了我们模型的有效性。
translated by 谷歌翻译
自动解决数学字问题是自然语言处理领域的关键任务。最近的模型已达到其性能瓶颈,需要更高质量的培训数据。我们提出了一种新的数据增强方法,扭转了数学词问题的数学逻辑,以产生新的高质量数学问题,并介绍了能够在数学推理逻辑中受益的新知识点。我们在两个Sota Math Word问题解决模型上应用增强数据,并将我们的结果与强大的数据增强基线进行比较。实验结果表明了我们方法的有效性。我们在https://github.com/yiyunya/roda发布我们的代码和数据。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译